Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Indexing k-mers in linear space for quality value compression.

Identifieur interne : 000524 ( Main/Exploration ); précédent : 000523; suivant : 000525

Indexing k-mers in linear space for quality value compression.

Auteurs : Yoshihiro Shibuya [Italie] ; Matteo Comin [Italie]

Source :

RBID : pubmed:31856669

Abstract

Many bioinformatics tools heavily rely on k-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive k-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each k-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input k-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant k-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff.

DOI: 10.1142/S0219720019400110
PubMed: 31856669


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Indexing
<i>k</i>
-mers in linear space for quality value compression.</title>
<author>
<name sortKey="Shibuya, Yoshihiro" sort="Shibuya, Yoshihiro" uniqKey="Shibuya Y" first="Yoshihiro" last="Shibuya">Yoshihiro Shibuya</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy.</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua</wicri:regionArea>
<wicri:noRegion>Padua</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy.</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua</wicri:regionArea>
<wicri:noRegion>Padua</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2019">2019</date>
<idno type="RBID">pubmed:31856669</idno>
<idno type="pmid">31856669</idno>
<idno type="doi">10.1142/S0219720019400110</idno>
<idno type="wicri:Area/PubMed/Corpus">000322</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000322</idno>
<idno type="wicri:Area/PubMed/Curation">000322</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000322</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000516</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000516</idno>
<idno type="wicri:Area/Ncbi/Merge">002442</idno>
<idno type="wicri:Area/Ncbi/Curation">002442</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">002442</idno>
<idno type="wicri:Area/Main/Merge">000527</idno>
<idno type="wicri:Area/Main/Curation">000524</idno>
<idno type="wicri:Area/Main/Exploration">000524</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Indexing
<i>k</i>
-mers in linear space for quality value compression.</title>
<author>
<name sortKey="Shibuya, Yoshihiro" sort="Shibuya, Yoshihiro" uniqKey="Shibuya Y" first="Yoshihiro" last="Shibuya">Yoshihiro Shibuya</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy.</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua</wicri:regionArea>
<wicri:noRegion>Padua</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
<affiliation wicri:level="1">
<nlm:affiliation>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy.</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua</wicri:regionArea>
<wicri:noRegion>Padua</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Journal of bioinformatics and computational biology</title>
<idno type="eISSN">1757-6334</idno>
<imprint>
<date when="2019" type="published">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Many bioinformatics tools heavily rely on
<mml:math>
<mml:mi>k</mml:mi>
</mml:math>
-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive
<mml:math>
<mml:mi>k</mml:mi>
</mml:math>
-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each
<mml:math>
<mml:mi>k</mml:mi>
</mml:math>
-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input
<mml:math>
<mml:mi>k</mml:mi>
</mml:math>
-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant
<mml:math>
<mml:mi>k</mml:mi>
</mml:math>
-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Italie</li>
</country>
</list>
<tree>
<country name="Italie">
<noRegion>
<name sortKey="Shibuya, Yoshihiro" sort="Shibuya, Yoshihiro" uniqKey="Shibuya Y" first="Yoshihiro" last="Shibuya">Yoshihiro Shibuya</name>
</noRegion>
<name sortKey="Comin, Matteo" sort="Comin, Matteo" uniqKey="Comin M" first="Matteo" last="Comin">Matteo Comin</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000524 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000524 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     pubmed:31856669
   |texte=   Indexing k-mers in linear space for quality value compression.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:31856669" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021